Goto

Collaborating Authors

 visual prompt


Appendix Implementation Details

Neural Information Processing Systems

A.1 Network Architectures We adopt Daformer [17] with Swin-B or MiT-B5 backbone as the base semantic segmentation architecture. For the segmentation head, we utilize the same head as Daformer [17]. The stem module contains one fully-convolutional layers with kernel 3 3 and stride of 2, two fully-convolutional layers with kernel 3 3 and stride of 1, two fully-convolutional layers with kernel 3 3 and stride of 2, and another three fully-convolutional layers with kernel 1 1 and stride of 1 to adjust channels of different feature maps. Level embedding module is defined as metrics with shape 3 dims. The prompt Interactor module contains three fully-convolutional layers with kernel 3 3 and stride of 2 to adjust feature dimensions.



OMG-LLaVA: Bridging Image-level, Object-level, Pixel-level Reasoning and Understanding

Neural Information Processing Systems

Current universal segmentation methods demonstrate strong capabilities in pixel-level image and video understanding. However, they lack reasoning abilities and cannot be controlled via text instructions. In contrast, large vision-language multimodal models exhibit powerful vision-based conversation and reasoning capabilities but lack pixel-level understanding and have difficulty accepting visual prompts for flexible user interaction. This paper proposes OMG-LLaVA, a new and elegant framework combining powerful pixel-level vision understanding with reasoning abilities. It can accept various visual and text prompts for flexible user interaction.


CAT: Coordinating Anatomical-Textual Prompts for Multi-Organ and Tumor Segmentation

Neural Information Processing Systems

Existing promptable segmentation methods in the medical imaging field primarily consider either textual or visual prompts to segment relevant objects, yet they often fall short when addressing anomalies in medical images, like tumors, which may vary greatly in shape, size, and appearance. Recognizing the complexity of medical scenarios and the limitations of textual or visual prompts, we propose a novel dual-prompt schema that leverages the complementary strengths of visual and textual prompts for segmenting various organs and tumors. Specifically, we introduce $\textbf{\textit{CAT}}$, an innovative model that $\textbf{C}$oordinates $\textbf{A}$natomical prompts derived from 3D cropped images with $\textbf{T}$extual prompts enriched by medical domain knowledge. The model architecture adopts a general query-based design, where prompt queries facilitate segmentation queries for mask prediction. To synergize two types of prompts within a unified framework, we implement a ShareRefiner, which refines both segmentation and prompt queries while disentangling the two types of prompts. Trained on a consortium of 10 public CT datasets, $\textbf{\textit{CAT}}$ demonstrates superior performance in multiple segmentation tasks. Further validation on a specialized in-house dataset reveals the remarkable capacity of segmenting tumors across multiple cancer stages. This approach confirms that coordinating multimodal prompts is a promising avenue for addressing complex scenarios in the medical domain.